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Abstract 


Pulsar search is always the basis of pulsar navigation, gravitational wave detection and other research topics. 
Currently, the volume of pulsar candidates collected by the Five-hundred-meter Aperture Spherical radio 
Telescope (FAST) shows an explosive growth rate that has brought challenges for its pulsar candidate filtering 
system. Particularly, the multi-view heterogeneous data and class imbalance between true pulsars and non-pulsar 
candidates have negative effects on traditional single-modal supervised classification methods. In this study, a 
multi-modal and semi-supervised learning based on a pulsar candidate sifting algorithm is presented, which adopts 
a hybrid ensemble clustering scheme of density-based and partition-based methods combined with a feature-level 
fusion strategy for input data and a data partition strategy for parallelization. Experiments on both High Time 
Resolution Universe Survey II (HTRU2) and actual FAST observation data demonstrate that the proposed 
algorithm could excellently identify pulsars: On HTRU2, the precision and recall rates of its parallel mode reach 
0.981 and 0.988 respectively. On FAST data, those of its parallel mode reach 0.891 and 0.961, meanwhile, the 
running time also significantly decreases with the increment of parallel nodes within limits. Thus, we can conclude 
that our algorithm could be a feasible idea for large scale pulsar candidate sifting for FAST drift scan observation. 


https: / /doi.org/10.1088/1674-4527 /ad0c28 
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1. Introduction 


So far, a lot of radio pulsars have been discovered by modern 
pulsar surveys, including the High Time Resolution Universe 
(HTRU, Burke-Spolaor et al. 2011) Parkes survey, Low- 
Frequency Array (LOFAR) Tied-Array All-Sky Survey 
(LOTASS, Coenen et al. 2014), Commensal Radio Astronomy 
FasT Survey (CRAFTS, Jiang et al. 2019; Wang et al. 2021), 
Galactic Plane Pulsar Snapshot (GPPS, Han et al. 2021), etc. 
Compared to previous ones, modern surveys tend to use more 
sensitive detection techniques, greater survey areas, improved 
data analysis techniques and collaborative efforts, such as 
CRAFTS. They often produce a large number of potential 
pulsar candidates, but only a very small proportion of these 
candidates (nearly one in ten thousand) are identified as real 
pulsars for the reason of large amounts of interference signals. 
Thus, it is critical to reduce the retention of a large number of 
non-pulsar signals without loss of pulsar-like samples. At 
present, this issue can be alleviated on two research points of 
pulsar search pipelines. (i) Signal processing: remove Radio 
Frequency Interference (RFI) signals from a large amount of 


observational data as much as possible (Morello et al. 2014; 
Yang et al. 2020) and optimize some important parameters 
such as signal-to-noise ratio (SNR) detections and so on. (ii) 
Candidate selection: minimize the labor of further observations, 
which focuses on filtering the pulsar-like samples among large 
numbers of candidates automatically and accurately by 
advanced artificial intelligence techniques. This paper involves 
the latter. 

Generally, existing pulsar candidate selection methods based 
on artificial intelligence could be classified into three categories, 
according to the principles of these methods. The traditional 
scoring methods are regarded as the first category, e.g., Pulsar 
Evaluation Algorithm for Candidate Extraction (PEACE, Lee 
et al. 2013). The second category refers to Machine Learning 
(ML) based classifiers that usually perform better than the first 
category, e.g., Morello et al. (2014), Lyon et al. (2016), Tan 
et al. (2018), etc. More recently, Burdwan (2019) applied the 
eight features designed by Lyon et al. (2016) to test the 
performance of the Random Forest (RF) algorithm, K-Nearest 
Neighbors (KNN) algorithm and Logistic Regression (LR) 
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algorithm. Xiao et al. (2020) designed a reliable KNN based 
model named Pseudo-Nearest Centroid Neighbor (PNCN) 
classifier for pulsar survey data streams, which can effectively 
deal with the class imbalance problem. In these methods, the 
features used often depend heavily on human experience, whose 
classification performance may be adversely affected. For 
example, some classifiers only extract features from the 
Dispersion Measure (DM) and pulse profile curve, which lead 
to some RFIs being identified as pulsars incorrectly. As the radio 
environment is becoming more complex, it is more difficult to 
effectively distinguish pulsar candidates and non-pulsar candi- 
dates only by statistical features. In practice, the pulsars can be 
successfully identified just by human experts observing the 
corresponding diagnostic plots. Based on this inspiration, the 
third category utilizes diagnostic plots as the inputs of image 
recognition models or multi-method ensemble models including 
image recognition, so that the "pulsar-like" mode can be learned 
from the diagnostic sub-graph automatically by training Deep 
Learning (DL) based models, e.g., Wang et al. (2019), Guo et al. 
(2019), Zeng et al. (2020), Zhang et al. (2021), etc. Compared 
with the ML based models in the second category, these methods 
have better generalization ability. Among them, the methods 
described in Wang et al. (2019) and Zeng et al. (2020) are 
mainly used in the pulsar search pipeline of the Five-hundred- 
meter Aperture Spherical radio Telescope (FAST) survey. Wang 
et al. (2019) presented a new ensemble classification system on 
FAST for pulsar candidate selection that is composed of five 
classifiers. Furthermore, it was incorperated in the development 
of the Pulsar Image-based Classification System (PICS) (Zhu 
et al. 2014). Zeng et al. (2020) designed an end-to-end online 
learning model, namely Concat Convolutional Neural Network 
(CCNN), to identify the candidates without any intermediate 
labels which are processed from FAST data. The aforementioned 
methods are mostly based on a single mode. At present, there is 
little literature about multi-modal methods applied to astronom- 
ical data mining, especially pulsar identification. Zhang et al. 
(2021) proposed an early fusion based pulsar image identifica- 
tion framework with smart under-sampling, which was evaluated 
on the HTRU Medlat data set. In this work, a semi-supervised 
learning and Feature-level Multi-modal Fusion based Hybrid 
Clustering (FMFHC) scheme is designed for large scale 
candidate sifting through FAST pulsar search pipelines, in terms 
of Wang et al. (2019) and Ma et al. (2022). 

A typical pulsar search pipeline for FAST drift scan 
observed data roughly includes the following several steps: 
() eliminating the obvious interference signals from the 
original data; (ii) dedispersing the data into time series with 
distinct DM values; (iii) performing a fast Fourier transform on 
each time series so as to further search for periodic signals; (iv) 
sifting these periodic signals and outputting candidates 
obtained to files (e.g., suffix pfd); (v) folding the data in a 
periodic manner and then outputting candidate images. In 
practice, there are still a lot of non-pulsar candidates in step 
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(v), including RFI. Our algorithm is introduced into the sifting 
stage of step (iv), aiming to further address the following issues 
in the pulsar candidate selection area: 


(i) For some aforementioned methods (e.g., the supervised 
ML based candidate signal classifiers and DL based 
diagnostic subplots recognition models), the cost of 
obtaining a large amount of labeled data (in which the 
proportion between the real pulsar and the non-pulsar 
sample is extremely imbalanced) and periodic training (to 
avoid overfitting and underfitting) is too high, and these 
methods are all based on binary classification. 

(i1) In the process of pulsar search, there are usually multi- 
view heterogeneous candidate data, which contain 
various types and attributes. In practical applications, it 
could be difficult to further mine the deep features hidden 
in these data through a single modal candidate selection 
algorithm. 


The rest of this paper is organized as follows: Section 2 
describes the pulsar candidate features and similarity measure 
involved. Section 3 presents the components of the overall 
algorithm in detail. In Section 4, the experimental data sets, 
data pre-processing methods and results are illustrated. The 
discussion is in Section 5. Finally, in Section 6, we present the 
conclusion and future work. 


2. Pulsar Candidate Features and Similarity Measure 
2.1. Pulsar Candidate Features 


The extraction of candidate features is very important to 
maximize the separation between non-pulsar and pulsar 
candidates. We assume the pulsar candidates were processed 
by the software pipelines based on PulsaR Exploration and 
Search TOolkit (PRESTO, Ransom 2011; Yue et al. 2013), 
which implemented similar search steps for advanced telescope 
systems such as FAST. 

In terms of statistical features, there are eight new features 
extracted from the pfd files by the feature extraction program 
(Lyon et al. 2016), including the mean value, excess kurtosis, 
standard deviation and skewness of the pulse profile, and the 
mean value, excess kurtosis, standard deviation and skewness 
of the DM-SNR curve. Note that the first four statistics 
correspond to the integrated pulse profile, and the remaining 
four correspond to the DM-SNR curve. These features were 
chosen to maximize the separation of various candidate 
classes when used together with an ML classifier. Further- 
more, the High Time Resolution Universe II (HTRU2) data 
set was used during the work Lyon et al. (2016). In terms of 
diagnostic plots, most features of a candidate signal can be 
visualized through different diagnostic subplots, including a 
folded profile plot, sub-integrations plot, sub-bands plot, 
DM-SNR curve, etc. The ensemble model based on PICS 
(Wang et al. 2019) can also extract four main feature plots of 
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a candidate from the pfd files, which are one-dimensional 
(1D) data array summed profile and DM curve, two- 
dimensional (2D) data array time versus phase (TVP) and 
frequency versus phase (FVP). Note that the size of 1D 
feature plots is 64 x 1, while the size of 2D feature plots is 
64 x 64. The experiments implemented on the FAST pulsar 
survey data (Wang et al. 2019) demonstrate that PICS- 
ResNet can achieve a higher recall rate of 98% than PICS 
(which is 95%). 


2.2. Similarity Measure 


A candidate believed to be a real pulsar must have very 
similar statistical features (e.g., standard deviation and skew- 
ness of pulse profile) and diagnostic subplots (e.g., 2D FVP) 
with some other known pulsars. That is the reason we plan to 
design a multi-modal clustering algorithm for large amounts of 
pulsar candidate data. Figure 1 displays an example of the 
features and plots. It can be seen that known pulsars J0358 
+5413 and J1915--1606 are very similar in the integrated pulse 
profile and FVP diagram. Similarly, interference signal I and 
interference signal II are also very similar in the integrated 
pulse profile and FVP diagram. The detailed information on 
these plots is described in Table 1. The feature similarity 
between candidates will be further validated in the exper- 
imental results of Section 4.2, as shown in Figure 8. 


3. The Method 
3.1. Feature Fusion Strategy 


Through fusing different features of multiple modalities 
extracted from a single candidate, feature fusion methods can 
further refine the features with higher discrimination. Among 
them, Discriminant Correlation Analysis (DCA) is a linear 
method which maximizes the pair-wise correlation between 
two feature sets and possesses very low computational 
complexity at the same time. 

Depending on the DCA algorithm, our objective is pre- 
processing candidate data extracted from pfd files by fusing 
features from different modalities of the same object before 
feeding them to a hybrid clustering scheme as input data. 
According to Section 2.1, these modalities include a 1D data 
array (statistical feature format on HTRU2) and 2D arrays 
(FVP and TVP formats on PICS), as illustrated in Figure 2. The 
DCA algorithm is able to establish the correlation criterion 
between the two groups of feature vectors, to extract their 
canonical correlation features. In this work, it is assumed that N 
training samples are collected from two classes, which are (0: 
non-pulsar, 1: pulsar). For each sample, two feature vectors 
with 8 and 64 dimensions are extracted from two modalities, 
which are 1D statistical features extracted from the feature 
extraction program of HTRU2 and 2D feature plots extracted 
from PICS. Then, X= R?*" and Y= RÓ^*" denote the data 
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matrices containing the two feature sets. The X template is 
composed of N 1D vectors (8 x 1) which have the same format 
as that of HTRU2. Furthermore, the Y template is generated by 
extracting N feature plots (TVP or FVP or fusion of both, 
64 x 64) using the unsupervised Convolutional Auto-Encoder 
(CAE) network as depicted in Figure 3. Note that each feature 
plot is dimensionally reduced to 8 x 8 through CAE and then 
resized to 64 x 1 by the reshape method in the Python toolkit. 
So, the dimension of Y is defined as 64 x N. 

The fine tuned CAE architecture we used is shown in 
Figure 3. Then, after model training (where Epochs—50, 
Batch size—128, the optimizer is Adadelta and the loss 
function is Binary crossentropy) and testing, the loss rate 
was maintained at around 3546, which can make the model 
retain most of the main features of TVP or FVP while ensuring 
runtime efficiency. By using this CAE, the entire clustering 
algorithm performed well in the subsequent experiments in 
Sections 4.1 and 4.2. Meanwhile, the process of the DCA 
method is described as follows: 


(i) The N columns of both data matrix are divided into c 
separate groups. Let Sp, be the between-class scatter 
matrix defined in Equations (1) and (2): 


[4 
NET eee T 
S bx so) = b» ni (xi mE x); m x)T = $5 io P exp? (1) 
i-l 
Oc) = [Ca ~ x), n2 (xi m Ke) ses) nc (xe — x)]. (2) 
where x; € X denotes the p dimensional feature vector of 
the jth sample in the ith class, x; represents the mean of 
the ith class and x signifies the whole feature set mean. 


Gi) Find transformation matrix W,, to transform the corresp- 
onding feature matrix Xgxy to X/,. y as follows: 


d T 
Xoxn) = Wy cox 8) X8 xn)» (3) 


Six = Wyre Spx Wpx = ®,,(®,,)" =, (4) 


where d; (5, )" is a strictly diagonally dominant matrix, 
which represents the correlation between the ith class and 
the jth class. Similar to Xs. , find transformation matrix 
Wpy to unitize the between-class scatter matrix Spy , which 
transforms Yg4..w to P N- 

(iii) Maximize the pair-wise correlation across the feature sets 
X' and Y'. This requires the between-set covariance 
matrix (Sy, = X'Y'T) to be diagonal through singular 
value decomposition (SVD) as follows: 

SL = UEV" SUSY SVE TSE (5) 


Norn 


The transformation matrices for X’ and Y’ are: 


Wax, (6) 


1 


Wy = Vx. (7) 
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Figure 1. An example of the similarity between candidates. Note that (a) is the diagnostic plots of known pulsar J1906--0746 and (b) is those of known pulsar J1807- 
0847; (c) is the diagnostic plots of interference signal I and (d) is those of interference signal II. 


Consequently, the final transformed feature sets are: 


X* = We Wir, (8) 


Y* = WLWLY. (9) 
(iv) The early feature-level fusion is performed by summing 
the final transformed feature sets, that is Z= X" + Y, 
where Z called the Canonical Correlation Discriminant 


Features (CCDFs). 


In this way, the data samples with fused features can be 
inputted into the hybrid clustering model in the next step. 


3.2. Hybrid Clustering 


Aiming to classify data into various homogeneous clusters in 
terms of their similarities, clustering methods are divided into 
different categories including density-based methods, partition- 
based methods and so on. As a representative partition-based 
clustering, K-Means is widely applied (Krishna & Narasimha 
Murty 1999). Due to its drawbacks (it is only applicable to 
distinguishing clusters with a hyper spherical data distribution 
and the clustering results are usually sensitive to the parameter 
settings), extended versions have been presented such as Arthur 
& Vassilvitski (2007) and Nguyen (2018). In addition, the 
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Table 1 


The Detailed Information on the Plots in Figure 1 
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Parameter / Name 


Telescope 
Candidate 
Epochtopo 
Epochpary 

Data folded 
Data Avg 

Data StdDev 
Profile Bins 
Profile Avg 
Profile StdDev 
DOFeff 
DM(pc/cm?) 
Ptopo(ms) 
P'topo(s/s) 
P'topo(s / s?) 
Poise 
Ppary(ms) 
Ptopo(ms) 
P'topo(s/s) 
P'topo(s / s?) 
R.A.52000 

Decl. 32000 
Receiver Name 
Antenna Gain 
Bandwidth 
Temperature 
Terminal Name 
Sampling Rate 
Polarization channels 
Number of channels 


J0358+5413 


FAST 
156.38 ms_cand 
59435.95677812153 
59435.95439208478 
2620416 
2.387e4-04 

332.9 
64 
9.772e4-08 
6.737e+04 
59.53 
57.727 
156.3697468 
4.602(20) x 1078 
0.0(5.0) x 10? 
0(6207.20) 
156.3274369(66) 
156.3697468 
4.604(20) x 1075 
0.0(5.0) x 10? 
12:34:56.7890 
—12:34:56.7890 
19 beam receiver 
16.1k Jy-1 
500 MHz 
20K 
psr 
49.152 us 
4 
4096 


J1915+1606 


FAST 
59.03 ms_cand 
58784.35905500046 
58784.35452882630 
1309696 
1.103e+05 
569.7 
64 
2.257e4-09 
8.15e+04 
60.31 
168.770 
59.027867 
0.0(1.0) x 1078 
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4096 


Density Peaks Clustering (DPC) algorithm adopted density 
peaks as features to quickly discover the potential cluster 
centers without any prior knowledge through drawing a 2D 
decision graph. More recently, another density-based algorithm 


using global and local consistency adjustable manifold distance 
(McDPC) was proposed by Wang et al. (2020b) to address the 
drawback (that it is easy to divide the same cluster into multiple 


microclusters corresponding to its multiple high-density points) 
of Clustering by Fast Search and Find of Density Peaks 
(CFSFDP, Rodriguez & Laio 2014). FMFHC is developed as a 
fusion of clustering methods based on DPC and K-Means. 
The clustering process of FMFHC is summarized in 
Figure 4, in which the main steps are summarized as follows: 


(i) To better adapt to different data structures, the k-nearest 
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FAST 
JERK _ cand 30 
58419.36111111111 
58419.36110676074 
12057600 
1.48e--05 
782.1 
64 
2.787e+10 
3.395e+05 
59.00 
203.538 
17.3762014 
2.3313(78) x 10-8 
0.0(8.6) x 10- P? 
«4.18e — 09(5.80) 
17.3762014(60) 
17.3762014 
2.3313(78) x 1078 
0.0(8.6) x 107"? 
12:34:56.7890 
—12:34:56.7890 
19 beam receiver 
16.1k Jy-1 
500 MHz 
20K 
psr 
49.152 us 
4 
4096 


Equations (10) and (11). 


Kapr_poly (Xi, Yj) = A exp | = 


Interference Signal II 
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neighbor based mixed kernel function of a Radial 
Basis Function (RBF) and Polynomial (RBF Poly) is 
used to calculate the density values of fused input 
data (mentioned in Section 3.1), as expressed in 


3:0 = Dawa 3) hs: 
q 21, A€ [0,1] 


where Kgpgr poiy(X;, yj) represents the mixed kernel 
function of data points x; and x;, A signifies the weight 
of the RBF function, o is the width of the RBF function 
and q is the order of the polynomial function. Although 
c, q and A cannot significantly improve the clustering 
effect, they can make the mixed kernel distance based 
similarity calculation more stable for multiple shapes of 
the data distribution if the A value is reasonable. The 
values of these three parameters are determined (c = 1, 
q— 2, A — 0.95) by past experiences since there is no 
analytical method for the selection of fusion coefficients 
currently, as mentioned in Wang & Xu (2017) and 
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Figure 2. Diagram of the feature fusion model. 
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Figure 3. Network architecture of original CAE. Note that the encoding module of CAE has four hidden layers as follows. The first layer: one 2D convolution with 
(3 x 3) x 32 (output channels) and one 2D max-pooling with 2 x 2; the second layer: one 2D convolution with (3 x 3) x 16 and one 2D max-pooling with 2 x 2; the 
third layer: one 2D convolution with (3 x 3) x 8 and one 2D max-pooling with 2 x 2; the fourth layer: one 2D convolution with (3 x 3) x 1. The following value 
was used: the Input size, 64 x 64 x 1; kernel, 3 x 3 pixels; padding, 1 x 1 pixels; the Encode output size, 8 x 8 x 1. 


Huang et al. (2013). of data point x; is defined as Equation (12) 


6; = i S i» Aj)> 12 
Pi = > Knnr. Poly (xi xj) x ( 1 1) Pu Mah (x, xj ( ) 
xjeKNNG; 


where Syan(x;, xj) signifies the Mahalanobis distance 
where p; denotes local density of data point x; and between x; and x; In addition, the density threshold 
KNN(x;) refers to the KNN of x;. Further, parameter 6; Poutlier iS used for extracting outliers from the whole 
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Figure 4. Flow chart of hybrid clustering scheme. 


data set, some of which may be special pulsars that need 
to be further investigated. An improved K-dist 
graph method is employed to determine Poutiier (Wang 
et al. 2020b). In the K-dist graph, outliers often are 
located at the leftmost local density level named PL, 
which comprises edge points of natural clusters with 
lower p and lower 6 values and outliers with lower p 
and higher ó values. To further distinguish which data 
points in PL should be considered as outliers, Pouttier 1S 
determined from Equation (13) 


(&: PL,...,PL;), Dx P > Poutlier 
{outc: PLj+1,...,PLy}, [E X. Poutlier 


PL 2 


» (13) 


where PL; denotes a data point in PL, outc means the set 
of outliers and outc is set in an autonomous manner, 
i.e., Vx; € oute , Vx; € E , 5p, S Po 


(ii) The improved cluster center selection scheme of DPC is 


used with automatic determination of the number of 
clusters and their center points. All the values of p; and 6; 
(x;  outc) are used for generating the 2D decision 
graph which helps to select the initial cluster centers 


24:035022 (14pp), 2024 March 


You et al. 


Decision Diagram Ô; irehold 


Figure 5. 2D decision graph based on p; and ô;. 


automatically. In the derived decision graph as shown in 
Figure 5, the parameters Pthrehoia aNd Othrenola are reason- 
ably set as the truncation threshold to form a rectangle 
with a red border, in which all data points are selected as 
representative points (i.e., initial cluster centers). More- 
over, the gap area with the yellow background separates 
these representative points from other points. Note that 
the representative points are also the multi-density center 
points, and the rest of the data points will be allocated to 
these center points to form intermediate microclusters. 
This representative point selection scheme is similar to 
that in the McDPC algorithm, which can divide the 
remaining samples into different density levels. However, 
it has better generalization performance and can identify 
more multi-density data sets compared to McDPC. 


(iii) After the number of clusters k and the initial cluster 


centers are determined, the improved iterative optim- 
ization scheme of cluster centers of K-Means is used for 
all data point regroupings and final convergence. The 
distance between any point x; (x; ¢ outc) and each cluster 
center in the current iteration is calculated based on an 
RBF kernel function. Note that RBF can improve the 
similarity measure between two points by mapping from 
measured distances to a high-dimensional space. More- 
over, starting from the 2nd iteration, a weighted distance 
optimization is adopted for similarity measure as shown 
is Equations (14) and (15) 


Max p — Min p (14) 


NEW _Prenterj = Max = 


P center; 


Here new_fcenter; denotes the weight value of cluster 
center center; used for distance optimization, and Max_p 
and Min_p signify the maximum and minimum density 
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Figure 6. Data partition based on sliding window. 


of data points not in outc respectively. 


Snew (xi, center;) 


_ FC exp || xi— = I )) " (new. pua, (15) 


where S$,4,05, center; signifies the weighted distance 
between xj(x;  outc) and center; As a result, all the 
clusters in the current iteration and corresponding cluster 
centers are updated. The weighted distance makes data 
points move closer to cluster centers with relatively 
smaller density nearby, which is conducive to determin- 
ing cluster boundaries. 


For each iteration, the Sum of Squares of Errors (SSE) for all 
the points are updated. Then, the K-Means iterative process 
will enter the next update cycle of cluster centers until the SSE 
value has little change compared with the previous round. 
Finally, all the data points except outc are assigned to the k 
clusters. 

The Parallel Hybrid Clustering Analyzer (PHCAL, Ma et al. 
2022) combines the advantages of the McDPC and K-Means 
algorithms to ensure the stability and depth of data mining for 
pulsar candidates. Compared to PHCAL, the clustering process 
of FMFHC can improve the flexibility of determining initial 
cluster centers on data sets with an irregular shape distribution 
and get a more stable clustering and outliers. 


3.3. Data Partition Strategy 


Statistics show that the FAST 19 beam receiver can provide 
more than a million candidates per night, as mentioned in Liu 
et al. (2021) and Yin et al. (2022). To improve the time 
performance of FMFHC, it is essential to examine the parallel 
implementation of FMFHC based on the models, e.g., Message 


Passing Interface (MPI), SparkCore, etc. For this reason, the 
sliding window based data partition strategy for candidate data 
streams (Ma et al. 2022) is adopted on the basis of data 
structure. As can be seen in Figure 6, the window size of each 
round is fixed to Batchsize — w. Then, a relatively complete set 
is formed by selecting appropriate samples from the actual 
pulsars of various types, which will be added to the block to be 
detected (shadow areas) at a specific ratio (v: w) in each round. 
According to the clustering results in every block, the clusters 
whose pulsar sample proportion is greater than a certain 
threshold (e.g., 5096) will be regarded as pulsar data areas and 
entered into a unified list for further validation. In addition, it 
should be determined if the outliers screened out before 
clustering are special pulsars. The sliding window mode 
enables each data sample to appear in two or more blocks with 
multiple data distributions, so it becomes possible for some 
data points classified incorrectly in some blocks to be identified 
correctly in other blocks. Note that the specific parallelization 
schemes are not discussed in this work. 


3.4. Time Complexity Analysis 


As a combination of clustering methodology of density- 
based and partition-based methods, the serial clustering process 
of FMFHC is more complex than the K-Means and DPC 
algorithms. However, this deficiency can be made up as much 
as possible by using a reasonable parallelization method. 
Table 2 shows the time complexity of a serial clustering 
process of FMFHC and compares it with three other common 
serial algorithms, i.e K-Means++ (Arthur & Vassil- 
vitski 2007), McDPC (Wang et al 2020b) and KNN 
(Peterson 2009). 

Let the total number of samples of a data set be n, then: (1) 
The time complexity of K-Means+-+ is O(nkTM), which can 
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Table 2 
Time Complexity Statistics of FMFHC and Other Serial Algorithms 
Algorithm Parallel Mode of FMFHC Serial Mode of FMFHC K-Means++ McDpc KNN 
Time Complexity limg( 1 O(G(p)m)*) O(n’), if 35. ., C; < threshold, O(nKTM) O(n’) O(nM + nD^) 


Note. T: the number of iterations; M: the characteristic number of elements; m: the number of samples of Block(i); k: the number of cluster centers; D: the nearest 


neighbor parameter. 


be simplified to O(n) as k, T and M are considered as constants. 
(ii) The time complexity of McDPC based on different density 
levels is O(n”) since the computing complexity of parameters p 
and 6 is O(n”). (iii) The complexity of KNN is O(nM + nD) as 
calculated from the worst case. (iv) The serial time complexity 
of FMFHC without applying the data partitioning strategy 
in Section 3.3 is OG nGL + n? + nKTM), where 
OUT. nC;L) denotes the time complexity of the original 
CAE architecture which has four convolution layers, O(n’) 
signifies the time complexity of the multi-density center 
selection scheme and O(nkTM) represents the time complexity 
of the improved cluster center iterative optimization of 
K-Means. Note that C; is the time complexity of a single 
convolution layer i(1 <i<4) of a sample (that is C; = 
O(V? x we x Cin; x Cout;), where V; denotes the size of 
output feature map of convolution layer i, W; signifies the size 
of convolution kernel of convolution layer i, Cin; represents the 
number of input channels, Cout; corresponds to the number of 
output channels), and L is the number of training iterations. In 
addition, the time complexity of the feature fusion process 
between the 1D feature arrays (8x1) and (64x1) is close to O 
(n) according to Section 3.1, which could be neglected. 

If 577. C; are small enough (i.e., yc C; < threshold) and 
n is large enough (ie. > threshold>), the serial time 
complexity of FMFHC can be simplified to O(n’), where 
E Ci, k, T, M and L are considered as constants. Obviously, 
this is an idealized state and the complexity value is close to 
McDPC but higher than KNN and K-Means+-. In terms of 
this premise, the time complexity of the parallel mode of 
FMFHC can be further discussed when using the sliding 
window based data partition strategy. As a result, the parallel 
complexity of FMFHC is O((G(p)m))) in light of the Sun-Ni 
theorem (Sun & Ni 2002), where G(p) denotes the factor and m 
signifies the number of samples in Block(i) with m «& n. When 
the number of parallel nodes p tends to a certain threshold 
(close to the total number of divided blocks) and the 
communication delay tends to be ignored, the complexity 
value is simplified to O(n?) (G(p) — 1). Therefore, if the 
communication overhead is very low or even negligible, the 
time complexity of the parallel mode of FMFHC is 
significantly lower than that of its serial version in theory, 
which will be verified in practice later. In addition, the speedup 
(Sp) and parallel efficiency (Ep) for the parallel version of 


Table 3 
Basic Information on Both Data Sets used in This Study 
Data Set Samples Pulsars Non-pulsars R* 
HTRU2 17 898 1639 16 259 9.92:1 
FAST data 157 616 78 157538 2019.71:1 


Note. R: the class imbalance ratio of non-pulsars to pulsars. 


FMFHC are defined as follows. 


$ i resno. 
2 B T, E O(n) m«n Ci< threshold; 
"^ T, — \ lime) 10 (G(p)m)) ' 
(16) 
B 
E, =~ (17) 
P 


where T, is the serial running time of FMFHC, and T, is the 
running time of the parallel mode of FMFHC under p parallel 
nodes. In theory, the performance of the parallel mode of 
FMFHC will remain consistent for different data sizes and 
hardware resources, depending on the following two condi- 
tions: (i) The p value is close to a sufficiently big threshold; (ii) 
The ratio of communication delay in total running time is small 
enough to be neglected. 


4. Experiments and Results 
4.1. Datasets and Evaluation Metrics 


Our algorithm was tested on both HTRU2 and FAST data 
sets. HTRU2 is an open telescope data set describing a sample 
of pulsar candidates collected during the HTRU Survey, which 
consists of 16259 non-pulsar samples and 1639 pulsar 
samples. It is widely adopted to evaluate the performance of 
ML based classification algorithms. Lyon et al. (2016) made 
the HTRU2 data set available, which has been uploaded on the 
website." The class imbalance ratio of HTRU2 is 9.92:1. It is 
widely adopted to evaluate the performance of ML based 
classification algorithms. Another FAST data set is obtained 


5 https: //figshare.com/articles/data set/HTRU2/3080389/1 
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Table 4 
Confusion Matrix 


Actual Category Predicted 


Results Positive Negative 
True data True positive (TP) False Negative (FN) 
False data False positive (FP) True Negative (TN) 


from the actual observation data of FAST (CRAFTS). The 
CRAFTS database is uploaded on the website.’ In the FAST 
data set, 157 616 candidates with pfd files were collected from 
the survey, among which 78 were pulsar samples and 157 538 
RFI samples. The class imbalance ratio of the FAST data set is 
2019.71:1. Table 3 shows the basic information on both 
experimental data sets. 

The evaluation metrics adopted for performing candidate 
classification usually are Precision, Recall and Fl-Score. 
Table 4 shows the confusion matrix of the classification. 
Precision means the proportion of actual pulsar samples 
properly classified in all the candidates which are classified 
as positive, and Recall is the proportion of actual pulsars 
correctly classified. Precision and Recall are often inversely 
proportional (when Precision is high, Recall is usually low), so 
Fl-Score can be used to reconcile this pair of metrics. 
Combined with the data partition strategy in Section 3.3, the 
overall performance metrics, i.e., Precisiongyeran, Recalloveran 
and F1 — Scoreoyeray, are defined as follows: 


L 
PrecisiOnoveral = : 5 T5 , (18) 
L\i= (TP; + FP) 
Reval epe dy (19) 
TP; + FN, 


Fl — Scoreoveral = — 


(>: 2 x Precision; x Recall; 
L 


} (20) 


7-1 Precision; + Recall; 
where TP;, FP, and FN; respectively denote the number of True 
Positive, False Positive and False Negative cases in Block(I), 
and L is the total number of divided data blocks. 


UTP 


Recall overall SS aA 
TP + FN 


(21) 


where UTP = TP, U TP, U TP; --- TP} means the union of 
identified pulsar samples in each data block, and TP and FN 
respectively denote the total number of True Positive and False 
Negative cases in entire data set. Note that, once a pulsar 
sample has been correctly identified in a Block(I), it will be 
counted to TP. 


? http: //groups.bao.ac.cn /ism/CRAFTS /202203 /t20220310_683697.html 
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Figure 7. The Pjhrenoia ANd Sthrehola triangle plot. Note that p1 denotes Pyhrehoia 
and p2 denotes Ĉthrehold- 


4.2. Clustering Effect Test using the HTRU2 Data Set 


To validate the clustering result of our model, it has been 
experimented on with HTRU2 and compared with other single- 
modal candidate signal classifiers, including supervised and 
unsupervised learning algorithms implemented on HTRU2 in 
the mentioned literature. The data pre-processing was carried 
out first. According to the multi-modal fusion strategy in 
Section 3.1, all the new candidates in HTRU2 were formed by 
the fusion of the original 1D feature arrays and related 2D TVP 
arrays. Note that the 2D TVP arrays were extracted from other 
selected pfd files with very similar characteristics to corresp- 
onding HTRU2 samples. Moreover, 800 pulsar samples and 
4259 non-pulsar samples were used for DCA algorithm 
training. Next, 1600 of the 1639 real pulsar samples of HTRU2 
were randomly selected as a pulsar set s, while the remaining 
39 pulsar samples were randomly dispersed to the non-pulsar 
samples of HTRU2 to form the data set to be detected. In terms 
of the data partition strategy in Section 3.3, the sliding window 
size was set to Batchsize=2 where the unit-size was 1161, then 
the entire data set to be detected was divided into (f, £5, ---, 
t13, f14) based on unit-size, where fi, 15,...,£533 = 1161 but 
t14 = 1205. Consequently, the experimental data set consists of 
14 data blocks, including (Block(1): [s, tı, t5], Block(2): [s, t», 
t3], ..., Block(13): [s, t13, fj4], Block(14): [s, tj4, t)]}. Each 
Block(i) was clustered separately, and the clusters with higher 
pulsar proportions than a certain value (e.g., > 50%) in the 
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Table 5 
Classification Results with Different Methods on HTRU2 
Classfication Method Precision Recall Fl-Score 
SVM 0.723 0.901 0.789 
Supervised PNCN 0.923 0.831 0.874 
RF 0.958 0.891 0.921 
KNN 0.952 0.875 0.909 
K-Means 0.926 0.747 0.827 
++(k=35) 
McDpc 0.592 0.288 0.388 
Unsupervised /Semi- PHCAL 0.946 0.905 0.881 
supervised 
The parallel 0.981 0.988 0.974 
mode of 
FMFHC 


clustering results were selected into the pulsar-like candi- 
date list. 

The values of Pthrehola and Othrehoia have a significant 
influence on the Fl-Score, so they have an important effect 
on the clustering. During the execution of our algorithm on 
HTRU2, the values of puenoia and Othrenoia Were randomly 
changed over 2000 rounds. Then, the 800 pairs of Pinrenhoia and 
Othrehoia (the corresponding values of Fl-Score were not less 
than 0.8) in the 2000 rounds were selected to make an actual 
two parameter triangle plot, as shown in Figure 7. The fitting 
result shows that there is no interaction between them. 
Moreover, both P¢nrehoia and Othrenola can be fine-tuned by 
using the heuristic methods (Wang et al. 20202). On HTRU2, 
by analyzing the K-dist graph, we plot the respective inflection 
points of p; and 6(x;¢outc) which are easily found. 
Consequently, Pthrehora May be taken from a range of values 
(p; € [0.2, 0.5]) to obtain the best results, and it is the same for 
Óthrenoia (6; € [0.02, 0.075]), and then the values were fine-tuned 
heuristically. For data sets with a single local density level, 
reasonable thresholds for p; and 6; will result in the inclusion of 
all obvious cluster centers, based on the DPC’s assumption that 
data points with higher p and higher 6 values should be selected 
as cluster centers. By such a heuristic method, it is not difficult 
to find a reasonable Pinrehoia Value (Pthrehola = 0.3) and 6 value 
(Othrehold = 0.024). In addition, pog is set to 0.00051 
according to Section 3.2. 

Table 5 shows the classification performance of FMFHC on 
HTRU2, compared with other unsupervised /semi-supervised 
algorithms (including K-Means++, Arthur & Vassilvitski 
2007), McDPC (Wang et al. 2020b) and PHCAL (Ma et al. 
2022) and supervised algorithms (including the RF in 
Burdwan 2019), and the KNN, SVM and PNCN in Xiao 
et al. (2020) implemented on HTRU2. Among all these 
unsupervised /semi-supervised and supervised algorithms, the 
parallel mode of FMFHC has the highest Precision (reaching 
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98.1%), Recall (reaching 98.8%) and Fl-Score (reaching 
97.4%). Moreover, upon executing several rounds of the 
control test in which 39 pulsar samples were randomly selected 
to form the data set to be detected, all the 39 pulsar samples 
could be detected (i.e., 100%) in each round by the parallel 
mode of FMFHC. 


4.3. Robustness Test Using the FAST Data 


The FAST data set was used to further verify the robustness 
and efficiency of FMFHC. Note that the data pre-processing on 
FAST data is very similar to that on HTRU2. At first, to change 
the class imbalance ratio between pulsar and non-pulsar 
samples in a single block, 1600 known pulsar samples from 
multiple types were prepared as a known sample set s’ and 
added to this data set. Some of the pulsar samples in s’ are 
known pulsars searched by CRAFTS during synchronization 
testing, and they can be found in the linked star catalog.'? 
Moreover, the DM range of these pulsar samples is often set to 
[2, 1000] (cm ?po). Those remaining in s’ were collected from 
international surveys such as HTRU. All of these pulsar 
samples are normal pulsars. According to Section 3.1, all the 
new candidates in FAST data (containing the above known 
sample set s’ ) were formed by the DCA fusion of the 1D 
feature arrays and related 2D TVP arrays, which were extracted 
from corresponding pfd files by the feature extraction programs 
of HTRU2 and PICS. In addition, the known sample set s’ 
(1600 real pulsar samples) and 10 000 non-pulsar samples were 
used for DCA algorithm training. Next, the sliding window size 
was also set to Batchsize=2 and the unit-size was 1251, then 
the original data set was divided into (gi, 25, ***, 8125, £126). 
where 21, 22, :::, 8125 = 1251 but giog = 1241. As a result, 
the experimental data set consisted of 126 data blocks, 
i.e., Block(1):[s', g,, g,], Block(2):[s’, 8, g3],-..,. Block(125): 
[s’, 255. 81261, Block(126):[s', 9156, 9]. Note that the original 
78 pulsar samples from the FAST data were randomly 
distributed in (gi, 82, :::, 2125, 8126). The subsequent 
clustering process is the same as that on HTRU2, refer to 
Section 3.2. The specific parameters 
Othrehold = 0.023 and Poutlier = 0.00051. 

Our algorithm has been experimented on by using a Linux 
cluster environment with seven physical computing nodes 
with seven Intel 6230 Xeon @ 2.1 GHz CPUs with 480 CPU 
cores (four Nvidia-GeForce-RTX-2080Ti, 5.3TB of total 
RAM, 3.6PB of total disk space). The system is running 
Linux 3.10.0-862.e17.x86. 64, Python 3.8, Tensorflow 2, and 
MPl4py. The experimental results demonstrate that the 
highest number of pulsars identified in a round by the 
parallel mode of FMFHC achieves 76 of 78, with an average 
of 75 (Recall of 96.1%, Precision of 89.1% and F1-Score of 
92.7%), compared with the PICS (Recall of 95%) and 


Pthrehold = 9.3, 


0 http: / /groups.bao.ac.cn/ism/CRAFTS /CRAFTS / 
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Figure 8. The clustering effect of FMFHC on FAST data. Note that the input data after multi-modal feature fusion are transformed into the arrays with two features X 
and Y. 
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Figure 9. Average running time of parallel FMFHC based on MPI. 


PICS-ResNet (Recall of 98%) in mentioned literature (Wang Table 6 
et al. 2019). Figure 8 shows the clustering effect of FMFHC Failure Modes and Codes 
on FAST data. Code ^ How They Fail Failure Cause Existence —— 

In addition, the running time data were collected under a Yes No 
different number of parallel nodes to further verify the time 301 Isolated points Classification criteria ————————— V. 
complexity of FMFHC. Note that the entire running time of 302 Collinearity Signal feature selection s 
FMFHC refers to the sum of execution time of data import, 303 Data standardization Dimension of indicators d 
CAE based feature extraction, DCA based feature fusion and ai Pattern similarity Data ies a and ul 
hybrid clustering, excluding the training time for the CAE and 305 Sein E nad between / 
DCA models. Figure 9 depicts the average running time of ples clustering true pulsars and noise 
parallel FMFHC with a different number of CPU cores. As can candidates 
be seen from the figure, the average running time decreases 306 Fast clustering of — — Cluster center selection á 

large sample data and manual setting of 


with the increment of parallel nodes within limits. When the 
number of cores reaches 36, it drops to 90.85 s. 


cluster numbers 
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After analysis, it is concluded that the non-pulsar based 
transients in FAST data could be roughly classified into other 
cosmic source signals such as Fast Radio Bursts (FRBs), RFI, 
low DM or narrowband signals, and very weak signals, and 
they can be identified by the signal shapes and signal features. 
Most of the non-pulsar signals incorrectly identified as pulsars 
here are broadband RFI. Moreover, the failure modes of 
clustering methods include the following six categories, where 
pattern similarity measure and fast clustering of large sample 
data still exist for FMFHC, as shown in Table 6. 


5. Discussion 


The overall performance analysis of FMFHC includes two 
parts: classification performance testing and running time 
evaluation. On HTRU2, FMFHC can ensure the clustering 
effect (the highest Precision, Recall and F1-Score) which looks 
better than other single-modal pulsar candidate classification 
methods in mentioned literature. On the FAST experimental 
data collected from CRAFTS, it still performs well (Precision 
and Recall) compared with the PICS and PICS-ResNet, and its 
parallel mode significantly reduces the execution time while 
ensuring the classification performance. It should be explained 
that HTRU2 is a public data set specifically designed to test the 
performance of pulsar candidate classifiers, where each data 
sample is carefully selected. However, actual FAST observa- 
tion data are more diverse and complex in terms of data 
distribution than HTRU2 data. So, the performance of FMFHC 
on the CRAFTS database of FAST does not appear as excellent 
as that of HTRU2. Consequently, we can believe that FMFHC 
is effective for high-volume pulsar candidate data streams in 
actual scenarios. (1) A pulsar candidate data stream, regardless 
of its capacity, can be divided into fixed size blocks for parallel 
processing. (ii) It will further promote the discovery of outliers 
by clustering more meaningful classifications. (iii) It still could 
be improved with the optimization of data partition strategy and 
relevant parameters. In brief, our algorithm has provided a 
good theoretical and practical reference for sifting large 
numbers of pulsar candidate signals obtained from the FAST 
survey. 


6. Conclusion 


A multi-modal hybrid clustering method named FMFHC is 
presented for large numbers of pulsar candidates in this paper, 
whose contributions are summarized as follows: 


(i) A feature-level fusion scheme based on the DCA 
algorithm is applied to maximize the separation between 
pulsar and non-pulsar candidates for large amounts of 
pulsar candidate data. 

(ii) A combination of the multi-density peak identification 
scheme using a mixed kernel function for density 
computation and extension of cluster center iterative 
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optimization scheme of K-Means is adopted to improve 
the clustering effect for data distribution with multiple 
shapes and screen out the outliers which could be special 
pulsars. 

(iii) The semi-supervised learning mode without a large 
number of training samples and sliding window based 
data partition strategy are adopted to enhance the 
efficiency of the overall algorithm and reduce the 
execution time. 


FMFHC is proved to be feasible, but there are still the 
following shortcomings: (1) Enough real data are still needed to 
further validate our algorithm. (ii) Due to limitations in 
experimental conditions, the actual performance comparison 
between our algorithm and other advanced parallel algorithms 
in recent years has not been conducted in an MPI experimental 
environment. In the future, we plan to connect the proposed 
algorithm to the pulsar distributed search pipeline based on 
PRESTO for application test and improvement. The program 
codes of FMFHC were uploaded on website. ' 
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